Data Analysis of Sample Case Study Data.

Objectives:

1.To understand the context of the data.

2.To understand if any information can be 
    extracted from the data. 

3.To analyse how this information(if any) 
    can be used to draw meaningful bussiness conclusions  
    by using machine learning models.

1. Data Access.

Importing required Libraries.

Load the case study data.

Analyse the anotomy of the data.

Observation:

1. Data has 119390 entries with 22 columns. 

2. Exploratory Data Anlaysis.

Analyse the first hand information of the data by understanding each and every column in the data.

Column Name : type

Observations:

1. Column 'type' doesn't have any Null or Empty values.
2. It contains two uniques categories 'C' and 'R'.
3. It is assumed that this column represents the type of hotel. Say Resort Hotel (R) and Common Hotel(C).
4. 66.4% of bookings are attempted for 'C' type hotel and 33.6% for 'R' type hotels. 

Column Name : canceledFlag

Observations:

1. Column 'canceledFlag' doesn't have any Null or Empty values.
2. It contains two uniques categories '0' and '1'.
3. It is assumed that this column captures information about the hotel booking is canceled or not. Value of '0' represents no cancellation and '1' represents cancellation. 
4. 63% of bookings are not cancelled and 37% are cancelled.


1] Naming convention for the column names leaves us with an intution that the data belongs to hotel booking register.

2] At this point an important decision a made to consider the 'canceledFlag' column as target.

3] From bussiness point of view, it will be a big edge for hoteliers to know the probability of cancellation of a booking.

4] So the plan is to analyse the data to understand if there is any predictive power in the data to predict the cancellation of a booking.

EDA Continues...

Column Name : time2Checkin

Let us understand the distribution of 'time2Checkin' column and how this column is corellated with canceledFlag

Observation:

1. There is a clear difference in mean of the both the distributions of 'canceled' and 'Not Canceled' distributions
2. Mean time2checkin of canceled booking is accumulated at 144.
3. Mean time2checkin of 'Not canceled' booking is accumulated at 79.9.
4. This difference in mean can prove to have good predictive power to predict booking cancelation.

Column Names : arrivalMonth, arrivalWeek & arrivalDay

Observation:

1. There is a clear pattern of bookings peaking in summer season.
2. Booking are low in November, December & January.
3. And hotel booking steadyly increased from January till August.

Observation:

1. Difference between the mean of 'cancelled' and 'Not cancelled' booking is clear and persistent across all the months of year.

Observation:

1. No of bookings ploted by arrivalWeek is inline when plotted by arrivalMonth.

Observation:

1. There is no visual pattern observed for the corelation between no of booking and arrivalDay  

Column Names : numberWeekendnights, numberNights

title

title

Observation:

1. Distribution of numberWeekendnights & numberNights seems to similar in both 'canceled' and 'Not canceled' booking.

Column Name : adults, chidren

Observation:

1. Distribution of both adults and children seem to be similar between 'Canceled' and 'Non Canceled' bookings.

Column Name : country

Observation:

1. In the dataset most of the booking data is related to european countries.
2. Booking are lead by Portugal with ~48K booking and followed by England, France and Spain. 

Column Name : segment

Observation:

1. It is assumed that 'segment' represents the customer segments.
2. It assumed that 'onl' represents online bookings. This segement dominates with 56.4K bookings.
3. It is followed by 'off'(assumed offline ), 'gro' (assumed group) and 'dir' (assumed direct) customer segments.

Column Name: repeatFlag

Observation:

1. It is assumed that 'repeatFlag'column captures the information about if the booking is from a repeat customer or not.
2. It is assumed '1' represents repeated customer and '0' represents new customer.
3. There is huge skew between repeat and new customrs.

Column Name: historicCancellations

Observations:

1. It is assumed that 'historicCancellations' column captures the inforamtion about the cancellation history of the customer.
2. Booking with no history of cancellation dominates the segment.
3. This is followed bookings from customers with one historic cancellation. 

Column Name : historicBookings

Observations:

1. It is assumed that 'historicBookings' column captures the information on no of booking done by customer in history.
2. This column can be used to feature engineer to mark if the booking is from an repeated customer or new customer.
3. However, the data revealed that ~97% of booking are from new customers.

Column Name: roomType

Observations:

1. It is assumed that 'roomType' column represents the type of room.
2. Type 'A' dominates the distribution with ~86K bookings.
3. Type 'B' and 'E' follow it with ~19K and 6K bookings respectively.

Column Name : assignedType

Observations:

1. It is assumed that 'assignedType' column represents the assigned catogery of room to customer. Sometimes the assigned room type can be different from booking room type. this could be because of overbooking.
2. Comparing the above plot with 'roomType' plot, it can be infered that.
    a. roomType 'A' is mostly booked room type.
    b. When roomType 'A' is overbooked customer gets assigned with roomType 'D'.

Column Name : changesFlag

Observations:

1. It is assumed that 'changesFlag' column caputures information and any changes made to booking.
2. About ~85% percentage of time people not make any changes to the booking.

Column Name : deposit

Column Name : waitingDays

Observations:

1. It is assumed that 'waitingDays' column captures the information about no of days the booking was kept under waiting before confirmation.
2. About ~97% of the times booking was confirmed on the same day.

Column Name : customerSegment

Observations:

1. It is assumed that 'customerSegment' column captures the information about cutomer segment. Couldn't infer the meaning of 'C', 'G' & 'T'.
2. Segment 'T' dominates the customer segment with ~96% of bookings.
3. This is followed by 'C' & 'G'

Column Name : numberofRequests

Observations:

1. It is assumed that 'numberofRequests' column captures information about if there are any additional requests made by customer.
2. Around 59% of time customers do not request for any addtional changes.

3. Data Dictionary

With the assumptions made above let us prepare a data dictionary.

#####   type                   : Type of hotel. 
#####   canceledFlag           : Is the booking canceled.  
#####   time2Checkin           : Time between booking and checkin.  
#####   arrivalMonth           : Checkin month while booking.
#####   arrivalWeek            : Checkin week while booking.  
#####   arrivalDay             : Checkin day while booking.
#####   numberWeekendnights    : No of weekend nights booked for.  
#####   numberNights           : No of nights booked.
#####   adults                 : No of adults included for occupancy.  
#####   chidren                : No of children included for occupancy.
#####   country                : Country that hotel beloings to. 
#####   segment                : Customer booking segment.
#####   repeatFlag             : Is the customer exsisting one or new.  
#####   historicCancellations  : How many times the customer had cancelled the booking historically.
#####   historicBookings       : How many times the customer did booking previously. 
#####   roomType               : What type of room is booked.
#####   assignedType           : What type of room is assigned. 
#####   changesFlag            : No of changes request after or while booking. 
#####   deposit                : Deposit type.
#####   waitingDays            : No of days the booking was kept in waiting state before confirmation. 
#####   customerSegment        : Which segment out of 'C','T' & 'G' does customer belong to. 
#####   numberofRequests       : No of booking done by customer previously.

Let us see the corelation plot

Observations:

1. 'canceledFlag' column show decent corelation with 'time2Checkin' & 'historicCancellation'